HybridDBRpred: improved sequence-based prediction of DNA-binding amino acids using annotations from structured complexes and disordered proteins

Abstract Current predictors of DNA-binding residues (DBRs) from protein sequences belong to two distinct groups, those trained on binding annotations extracted from structured protein-DNA complexes (structure-trained) vs. intrinsically disordered proteins (disorder-trained). We complete the first empirical analysis of predictive performance across the structure- and disorder-annotated proteins for a representative collection of ten predictors. Majority of the structure-trained tools perform well on the structure-annotated proteins while doing relatively poorly on the disorder-annotated proteins, and vice versa. Several methods make accurate predictions for the structure-annotated proteins or the disorder-annotated proteins, but none performs highly accurately for both annotation types. Moreover, most predictors make excessive cross-predictions for the disorder-annotated proteins, where residues that interact with non-DNA ligand types are predicted as DBRs. Motivated by these results, we design, validate and deploy an innovative meta-model, hybridDBRpred, that uses deep transformer network to combine predictions generated by three best current predictors. HybridDBRpred provides accurate predictions and low levels of cross-predictions across the two annotation types, and is statistically more accurate than each of the ten tools and baseline meta-predictors that rely on averaging and logistic regression. We deploy hybridDBRpred as a convenient web server at http://biomine.cs.vcu.edu/servers/hybridDBRpred/ and provide the corresponding source code at https://github.com/jianzhang-xynu/hybridDBRpred.


Introduction
Protein-DNA interactions are central for many cellular functions including transcription, gene regulation, DNA repair, and chromatin remodelling ( 1 ,2 ).They are annotated and studied using a variety of experimental methods, such as affinity purification, electrophoresis mobility shift assays, chromatin immunoprecipitation, CRISPR-Cas9-based techniques and atomic force microscopy (3)(4)(5).Molecular-level details are learned using X-ray crystallography, electron microscopy, and nuclear magnetic resonance, with around 7000 structures of protein-DNA complexes in PDB ( 6 ).However, these techniques do not keep up with a rapid accumulation of the protein and DNA sequence data ( 7 ,8 ), motivating the development and use of fast computational predictors of protein-DNA interactions from protein sequences (9)(10)(11)(12)(13)(14) and DNA sequences ( 15 ,16 ).These methods are developed using a limited amount of the experimental data and can be applied to predict interactions in a high-throughput manner for the uncharacterized sequences.The DNA sequence-based predictors were summarized and compared in a recent survey ( 15 ), while similar analysis is lacking for the protein sequence-based tools.
Based on their training datasets, they can be divided into two distinct groups: structure-trained predictors versus intrinsic disorder-trained predictors.The former group uses training datasets where annotations of DBRs are extracted from structures of protein-DNA complexes, typically using data from the PDB ( 6 ,60 ) and the PDB-derived BioLip ( 61 ,62 ) databases.The latter group utilizes training datasets collected from the DisProt database ( 63 ), where DBRs are lo-cated in the intrinsically disordered regions (IDRs).IDRs are segments in a protein sequence that do not have a stable three-dimensional structure under physiological conditions (64)(65)(66), which are especially abundant in eukaryotes ( 67 ,68 ).DBRs in IDRs are different from structured DBRs in several ways.They can interact with several different ligands by folding into different conformations, are enriched in disorder-promoting amino acids and have larger surface area (69)(70)(71).Furthermore, the importance of intrinsic disorder in the context of protein-DNA interactions was demonstrated in numerous studies (72)(73)(74)(75).We identify two intrinsic disorder-trained predictors of DBRs, DisoRDPbind and DeepDISObind.This low number can be explained by the fact that the corresponding experimental annotations were introduced relatively recently ( 76 ).The remaining 32 methods are structure-trained and none of the 34 predictors are trained using annotations that span across structured and disordered states.The substantial differences between the structured and disordered states suggests that the current predictors may provide poor results for the other type of annotations.This claim is supported by recent studies that empirically found that structure-trained (disorder-trained) predictors of protein-binding residues and RNA-binding residues provide inaccurate predictions for the disorder-annotated (structure-annotated) proteins ( 77 ,78 ).However, the current structure-trained predictors of DBRs were never assessed on the disorder-annotated proteins and vice versa.Furthermore, recent works identify a cross-prediction problem where amino acids interacting with a given partner type are cross-predicted as interacting with different partner types, resulting in partneragnostic predictions ( 49 , 53 , 56 , 77 , 79 , 80 ).In our case, this means that amino acids interacting with non-DNA partners (e.g.proteins and RNA) are predicted as DBRs.This may happen because sequence-based predictors of DBRs are typically trained on datasets composed of DNA-binding proteins, with few to no proteins that interact with the non-DNA partners.Thus, they might not be able to differentiate between different ligand types.While a few recent predictors of DBRs, including DRNApred, NCBRPred, and DisoRDPbind, were designed to reduce cross-predictions, a broad study that investigates this aspect is also missing.
Table 1 summarizes surveys that discuss predictors of protein-DNA interactions from protein sequences to examine whether literature already covers the above-mentioned aspects.The five reviews consider between 8 and 14 sequencebased predictors of DBRs and provide insightful information about their models, inputs and datasets that they utilize (9)(10)(11)(12)(13)(14).Three reviews perform comparative analysis, but they cover a rather narrow subset of predictors of DBRs.This is because they focus on a broader spectrum of sequence-and structure-based predictors of DNA, RNA, and protein binding residues ( 10 , 11 , 13 ).Importantly, the five surveys do not discuss the disorder-trained predictors and recent methods that were published after 2018, do not evaluate the predictors on the disorder-annotated interactions, and do not investigate the cross-predictions.We study a substantially larger number of predictors, including nine recently published methods, and we address the open questions regarding predictive performance of the disorder-trained vs. structure-trained predictors.We empirically evaluate a representative set of ten sequence-based methods that include five structure-trained predictors of DBRs, both disorder-trained predictors of DBRs, and three disorder-trained methods that predict interactions PAGE 3 OF 13

Selection of predictors
We select a representative subset of the sequence-based predictors of DBRs.We focus on methods that are relatively recent, publicly available and sufficiently fast to predict a large dataset.More specifically, they must satisfy the following criteria: (i) published on or after 2010; (ii) the server or code was publicly available and functional when we collected predictions; (iii) output predictions for an average length sequence (300 amino acids) in < 10 min; (iv) output real-valued propensities for DNA-binding and binary predictions for each amino acid, so their results can be evaluated with commonly used metrics.Consequently, we select five structure-trained methods: BindN+ ( 37 ), TargetS ( 41 ), TargetDNA ( 47 ), DNAPred ( 54 ) and DNAgenie ( 58 ); and both disorder-trained methods, DisoRDPbind ( 43 ) and DeepDISOBind ( 59 ).These methods include the two most recent predictors: the structure-trained DNAgenie and the disorder-trained DeepDISOBind.Using recent results from the CAID assessment ( 81 ), we supplement the disorder-trained DisoRDPbind and DeepDISOBind with three well-performing disorder-trained methods that satisfy the criteria and predict disordered binding residues, fMoRFpred ( 82 ), ANCHOR2 ( 83 ) and MoRFCHiBi ( 84 ).While these three methods were not originally designed to predict DBRs, we investigate whether they can be used for this purpose.

Datasets
We train and test the hybridDBRpred method on datasets that cover structure-annotated and disorder-annotated proteins, and which include a sufficiently large number of residues that interact with the other / non-DNA ligand types to as-sess the cross-predictions.They include the training dataset that we use to train a machine-learning model, the validation dataset that we utilize to optimize predictive performance of this model, and the test dataset that we apply to compare performance with the current methods.We follow procedures from related studies to compile these datasets ( 80 ,85 ).Briefly, this means that we use full protein sequences where the binding annotations are mapped across different protein-DNA complexes that share the same protein into the same UniProt sequence using SIFTS ( 86 ), increasing their quality and completeness.First, we collect the structure-based annotations of interactions from BioLip ( 62 ,87 ), which in turn processes data from PDB, and the disorder-based annotations from DisProt ( 63 ).Next, we cluster the collected proteins together with the combined set of training proteins for the 10 selected predictors (BindN+, T argetS, T argetDNA, DNAPred, DNAgenie, fMoRFpred, DisoRDPbind, ANCHOR2, MoRFCHiBi and DeepDISOBind) at 25% similarity using Blastclust ( 88 ).We pick one (the most recently released) protein from each cluster, which ensures that the selected proteins uniformly sample the sequence space and share low similarity.We select test proteins from the clusters that exclude any of the training proteins.Consequently, we collect 39 DNA-binding proteins and 396 proteins that interact with other ligand types, with the 2:1 rate of structure-vs.disorder-annotated proteins across both protein sets.The test dataset includes 435 proteins and 201 154 residues, with 2940 DBRs (1.5%) and 19 755 amino acids that interact with other ligand types (9.8%).
We source the training and validation datasets from clusters that include the training proteins of the 10 selected predictors and which exclude clusters used to pick the test proteins.This means that the training and validation proteins share low ( < 25%) similarity with the test proteins.We ensure that the validation dataset has similar numbers of the DNA-binding proteins when compared with the test dataset, which means that we select 13 disorder-annotated and 26 structure-annotated DNA-binding proteins into this dataset.We divide the remaining proteins proportionally between the training and validation dataset.Consequently, the training dataset has 591 proteins and 241 284 residues, with 4398 DBRs (1.82%) and 22 030 amino acids that interact with other ligand types (9.13%).The validation dataset includes 267 proteins and 116244 residues, with 2232 DBRs (1.92%) and 9960 amino acids that interact with other ligand types (8.57%).Supplementary Table S1 provides a detailed breakdown of the three datasets, which are freely available at http:// biomine.cs.vcu.edu/servers/ hybridDBRpred .

Assessment metrics and statistical analysis
Evaluation is done at the residue-level and assesses the quality of the predicted real-valued propensities for DNA-binding and binary predictions.We evaluate propensities with the commonly used Area Under the ROC Curve (AUC).Moreover, motivated by the fact that DBRs constitute a small fraction of the residues (1.5%) and inspired by past studies ( 11 , 49 , 58 , 77 , 80 ), we also compute AULC (Area Under the Low false positive rate part of the ROC Curve).AULC quantifies AUC for arguably the most useful part of the curve where the number of predicted DBRs does not exceed the actual number of DBRs.Since AULC is a relatively small number, we compute AULCratio that divides AULC of a given method by the AULC of a random predictor.This way, AULCratio = 1 indicates a prediction that is equivalent to a random result while higher AULCratio quantifies the rate of improvement over the random result, e.g.AULCratio equals 2 when a given result is twice better than a random predictor.We evaluate the binary predictions (binds DNA versus does not bind DNA) using several complementary metrics ( 80 ): where TP, TN, FP and FN indicate the number of true positives (correctly predicted DBRs), true negatives (correctly predicted non-DBRs), false positives (non-DBRs incorrectly predicted as DBRs), and false negatives (DBRs incorrectly predicted non-DBRs), respectively.We derive binary predictions from the propensities using a threshold, where residues with propensities > threshold are assumed to bind DNA and the remaining residues are assumed not to bind.We standardize the thresholds across predictors to allow for reliable side-byside comparisons.In particular, we use several thresholds to compare results for diverse predictive scenarios including low FPRs at 0.1 and 0.2 (given the low fraction of DBRs), and high sensitivities of 0.5 and 0.7.
Consistent with the assessments in several related studies ( 10 , 11 , 49 ), we evaluate false positives in two distinct categories, cross-predictions when they occur for residues that interact with the non-DNA ligands vs. over-predictions that we measure for residues that are not annotated to bind any ligands.Correspondingly, we compute two metrices: crossprediction rate (CPR) = F P non −DNA / N non −DNA , which quantifies the fraction of residues that bind non-DNA ligands that are predicted as DBRs, and over-prediction rate (OPR) = F P non −binding / N non −binding , that is the fraction of non-binding residues predicted as DBRs among all non-binding residues.Similar to AULCratio, to ease interpretation of these values we report CPRratio and OPRratio that are computed as the CPR and OPR of a random predictor divided by the CPR and OPR of the evaluated method, respectively.This way, the values of the two ratios quantify the rate of improvement over the random result.We also assess the propensities using the area under the cross-prediction curve (AUCPC) and the area under the over-prediction curve (AUOPC), which analyze the relation between CPR and TPR, and between OPR and TPR, respectively .Importantly , higher AUOPC and AUCPC values mean that the amount of the over-predictions and cross predictions is higher / worse.
Lastly, we quantify statistical significance of differences between results produced by different predictors.This analysis finds whether one method provides consistently better results when compared with another tool over a broad range of different datasets.We perform 100 random selections of 20 DNA-binding and 40 non-DNA binding proteins, with equal split of the structure-and disorder-annotated proteins, from the benchmark dataset.We evaluate statistical significance of differences over these 100 paired results using the Student's t -test if the measurements are normal based on the Anderson-Darling test at 0.05 significance ( 89 ); otherwise we apply the Wilcoxon rank-sum test.We assume that the difference is significant if the resulting P -value < 0.01.This is consistent with recent related works ( 10 , 49 , 58 , 59 , 77 ).

Architecture of the hybridDBRpred
Motivated by our empirical results that reveal that none of the current tools predicts accurately across the disorder-annotated and structure-annotated DNA-binding proteins, we design an innovative meta-predictor with the objective to significantly improve predictive performance.This meta-method utilizes a deep neural network to combine results generated by three complementary predictors of DBRs that include disordertrained method (DisoRDPbind) and two structure-trained methods (DNAPred and DNAgenie).These methods produce accurate results for different proteins and different sequence regions (structure vs. disorder-trained), and so an effective way to combine their results requires identifying these differences using sequence-derived information.Consequently, we utilize three groups of inputs: (i) amino acid-level predictions of DBRs; (ii) amino acid-level hallmarks of DBRs that can be derived from the sequence ( 11 ), such as polarizability, charge, hydrophilicity, propensity for intrinsic disorder ( 90 ), solvent accessibility that we predict with the quick and accurate ASAquick ( 91 ), and putative intrinsic disorder that we generate using popular and fast IUPred3 and (iii) aggregate features that target detection of IDRs by calculating propensity for disorder for sequence segments.We detail these inputs in the Suppl.Table S2.Altogether, we introduce four innovations to generate accurate meta-predictions.In particular, we (i) design feature group 3 that facilitates detection of IDRs since disorder-trained methods are biased to perform better for the disorder-annotated proteins while structure-trained proteins tend to be more accurate for the structure-annotated proteins; (ii) use a sliding window to present the amino acidlevel feature in groups 1 and 2 to the model, which provides useful context for the selection of the best input prediction of DBRs; (iii) utilize modern transformer modules to implement the deep neural network and (iv) train the transformer network using the binary cross-entropy loss function.
We summarize the architecture of our deep meta-predictor in Figure 1 .First, we convert the protein sequence into the sequence profile.This profile includes the input groups 1 and 2, which total to 10 features that we process using a sliding window of size 15, and which we combine with 20 features from the input group 3 (light green box in Figure 1 ).Next, we feed the sequence profile into a deep transformer network ( 92 ) that consists of three stacked transformer modules (light yellow block in Figure 1 ).Each transformer includes a self-attention unit connected to a feedforward layer that is followed by a normalization layer before feeding into the subsequent transformer.We pass the normalized output of the last transformer to the fully connected feed forward network that we use to reduce the multidimensional latent space produced by the transformers into the predicted DNA-binding propensity (light blue box in Figure 1 ).The feed forward network gradually reduces the latent space from 20, to 10, to 5 and eventually to the one neuron which outputs the binding propensity.We train this architecture using Pytorch with the popular Adam optimizer and the binary cross-entropy loss function.We set the learning rate and batch size to 0.0001 and 128, respectively.We apply the binary cross-entropy loss function ( 93 ) instead of the default mean absolute error (L1), which is motivated by the use of the former function in several recent related studies (94)(95)(96).The binary cross-entropy loss function maximizes the likelihood of making correct predictions, penalizes incorrect predictions (large differences from the ground truth) more substantially than near-correct predictions, and converges faster.

Comparative assessment of predictive performance
We primarily focus on investigating the ability of disordertrained methods to make accurate predictions for the structure-annotated proteins and vice versa.The results are summarized in Table 2 , with the corresponding ROC curves in Supplementary Figure S1 A (entire benchmark dataset), S1D (disorder-annotated proteins in the benchmark dataset) and S1G (structure-annotated proteins in the benchmark dataset).
For the structure-annotated proteins, Table 2 shows that four of the five structure-trained methods produce predictions with AUC > 0.74 and that DNAPred achieves the highest AUC = 0.81.These are relatively accurate predictions, as suggested by the AULCratio values that range between 3.5 and 7.3 for these four tools.On the other hand, the disordertrained methods provide low-quality results, with AULCratios ranging between 0.07 and 1.9, and DisoRDPbind producing the highest AUC of 0.62.We observe the same trend when using binary metrics.For instance, the four structuretrained methods obtain sensitivity values at the 0.1 FPR between 0.32 and 0.51, while the best disorder-trained DisoR-DPbind has sensitivity = 0.19.While the poor performance of ANCHOR2, fMoRFpred and MoRFchibi can be attributed to the fact that they were trained to predict disordered residues that bind to proteins and peptides, DisoRDPbind and Deep-DISObind target prediction of DBRs and still perform rather poorly.This likely stems from the fact that their predictive models that are trained from the disorder-annotated proteins do not generalize into the structure-annotated protein-DNA interactions.
For the disorder-annotated proteins, we find that the two disorder-trained predictors of DBRs outperform most of the structure-trained methods, securing AUCs of 0.64 (Deep-DISObind) and 0.63 (DisoRDPbind).The one exception is the structure-trained DNAgenie that has AUC of 0.68 and AULCratio of 3.8, and outperforms the disorder-trained methods.DNAgenie is a recently published tool that utilizes a training dataset of DNA-binding proteins collected from PDB, which are processed to map data from multiple protein-DNA complexes onto the same protein, resulting in a more complete set of binding annotations.It also uses disorder predictions as an input, which facilitates identifying putative disordered binding residues that undergo disorder-to-order transitions upon binding DNA ( 97 ).These factors can explain DNAgenie's ability to produce good results for the disorderannotated proteins.Overall, we find that while the disordertrained methods perform relatively well for the disorderannotated proteins, the best results are secured by DNAgenie.These observations partly agree with the related recent studies of predictions of protein-binding and RNA-binding residues We report averages and the corresponding standard deviations over the 100 subsets (see 'Assessment metrics and statistical analysis' section for details).We provide sensitivity that is calibrated for all methods to the same FPR = 0.1 and 0.2, and specificity calibrated to the sensitivity = TPR = 0.5 and 0.7; This allows for a direct comparison between methods under several diverse predictive scenarios.The best results for a given dataset and for each column are in bold font.We report results from the statistical significance test using superscript in the 'x / y' format where x indicates comparison against the current method with the highest AUC and y stands for the comparison against the new hybridDBRpred meta-predictor; +, = , and -denote that the best current predictor or hybridDBRpred is significantly better, not significantly different, significantly worse than another method, respectively, at P -value < 0.01.
PAGE 7 OF 13 ( 77 ,78 ).While we similarly show that disorder-trained methods are outperformed by the structure-trained tools for the structure-annotated proteins, we find that structuretrained DNAgenie performs favorably on the disorderannotated proteins.However, DNAgenie and the disordertrained methods are outperformed by the structure-trained DNAPred on the structure-annotated proteins.This means that a single predictor does not provide consistently best results for both structure-annotated and disorder-annotated proteins.
The evaluation on the full benchmark dataset provides further details.We find that the best results are produced by the structure-trained DNAgenie (AUC of 0.70 and AULCratio of 3.6) and DNAPred (AUC of 0.66 and AULCratio of 4.5).Their predictions are statistically significantly better than the results of the other eight methods ( P -value < 0.01).The best disorder-trained DisoRDPbind obtains AUC = 0.63 and AULCratio = 2.3.The corresponding ROC curves are in the Supplementary Figure S1 A. DNAgenie performs modestly well on the structure-annotated proteins (third-best AUC) and very well on the disorder-annotated proteins (best AUC, statistically better than all other methods at P -value < 0.01).DNAPred is the best option for the structure-annotated proteins (best AUC, statistically better than all other methods at P -value < 0.01) but performs poorly for the disorderannotated proteins.DisoRDPbind has the third-best AUC for the disorder-annotated proteins and performs by far the best among the disorder-trained tools for the structure-annotated proteins.The low-quality results produced by the disordertrained ANCHOR2, fMoRFpred and MoRFchibi are due to the fact that these tools predict disordered protein and peptide binding residues.The modest performance of the structuretrained TargetS likely comes from the broad scope of this model, which predicts 12 different types of ligands.
To summarize, we find that none of the current tools offers results that are the most accurate across the two types of annotations.The best results for the structure-annotated proteins are offered by the structure-trained DNAPred, for the disorder-annotated proteins by the structure-trained DNAgenie, and DisoRDPbind is the best disorder-trained tool.

Analysis of the cross-predictions and over-predictions
Recent studies demonstrate that some sequence-based predictors of binding residues suffer high cross-prediction rates, which means that they essentially predict binding residues in a ligand-agnostic manner ( 49 , 53 , 56 , 58 , 77 , 79 , 80 , 85 ).Table 3 analyses the false positives generated by the ten predictors to quantify the cross-prediction errors (residues that bind non-DNA ligands predicted as DBRs) and over-prediction errors (non-binding residues predicted as DBRs).The corresponding cross-prediction curves and over-prediction curves are in the Supplementary Figure S1 .The AUCPC (area under the cross-prediction curve) and AUOPC (area under the overprediction curve) of 0.5 correspond to random levels of performance while lower values indicate lower amounts of the cross-and over-predictions.On the other hand, CPRratio and OPRratio quantify the rate of improvement over a random predictor, where higher values denote more accurate results.TargetS, fMoRFpred, ANCHOR2 and MoRFchibi produce high amounts of cross-predictions and over-prediction with AUCPC > 0.4 and / or AUOPC > 0.4.This can be explained by the fact that TargetS was designed to predict interactions with multiple ligand types and since fMoRFpred, ANCHOR2 and MoRFchibi predict disordered residues that bind proteins and peptides.This means that the cross-predictions are expected for these methods.We focus our analysis on the other six predictors.
For the structure-annotated proteins, DeepDISObind performs poorly with AUCPC and AUOPC > 0.4, which means that it substantially overpredicts DBRs.The remaining methods perform relatively well with AUOPC on average at about 0.25 and AUCPC on average at about 0.33.The best structuretrained method, DNAPred, secures AUCPC = 0.276 and AUOPC = 0.180, demonstrating a good ability to selectively and accurately predict DBRs for the structure-annotated proteins.The same trends are reflected by the CPRratio and OPRratio scores, where DNAPred obtains the best results and high values of 3.0 and 5.3, respectively.The structure-trained methods (TargetDNA, BindN+, DNAPred, and DNAgenie) perform better than the disorder-trained DisoRDPbind, which is expected for this protein set.Moreover, the cross-prediction rates are higher than the over-predictions rates (i.e.OPRratios > CPRratios), which suggests that these methods are biased to cross-predict between the ligand types.
For the disorder-annotated proteins, the structure-trained TargetDNA, BindN+, and DNAPred perform poorly with A UCPC and A UOPC > 0.4.The disorder-trained DeepDIS-Obind also makes substantial amounts of cross-predictions (AUCPC = 0.45).The only two well-performing methods are the structure-trained DNAgenie and the disorder-trained Dis-oRDPbind.They obtain AUCPC and AUOPC at around 0.32 (DNAgenie) and 0.37 (DisoRDPbind), and are the only tools with CPRratio and OPRratio > 2.
Using the full benchmark dataset, we find that DeepDIS-Obind and TargetDNA suffer high rates of cross-predictions and over-predictions.Supplementary Figure S1 B and S1 C plot the corresponding cross-prediction and the over-prediction curves.DNAgenie is the best tool that secures AUCPC = 0.29, AUOPC = 0.30, CPRratio = 3.0 and OPRratio = 3.3, which suggests that it is at least three times better than a random predictor.These results are also statistically better than the results of the other nine methods ( P -value < 0.01).DNAPred, BindN+, and DisoRDPbind perform reasonably well, with A UCPC and A UOPC ≤0.4 and CPRratio and OPRratio at around or over 2.
To sum up, the structure-trained DNAgenie produces overall the best, low and balanced amounts of cross-predictions and over-predictions.The structure-trained DNAPred is better than DNAgenie for the structure-annotated proteins but makes excessively large amounts of cross-predictions for the disorder-annotated proteins.The best disorder-trained DisoR-DPbind generates modest amounts of both cross-and overpredictions that are similar across the structure-and disorderannotated proteins.The other tools make large amounts of errors, particularly in terms of the cross-predictions for the disorder-annotated proteins.Their AUCPC values are around 0.5, which suggests that they effectively predict all binding residues, irrespective of the ligand type.

Comparative assessment of hybridDBRpred's predictive performance
Our analysis reveals that none of the ten methods provides the most accurate results across both structure-and Table 3. T he e v aluation of cross-predictions and o v er-predictions f or the 10 str uct ure-trained and disorder-trained predictors of binding residues and the ne w h ybridDBRpred met a-predictor using the sampled test dat aset We report averages and the corresponding standard deviations over the 100 subsets (see 'Assessment metrics and statistical analysis' section for details).The best results for a given dataset and for each column are shown in bold font.We report results from the statistical significance test using superscript in the 'x / y' format where x indicates comparison against the current method with the highest AUC and y stands for the comparison against the new hybridDBRpred meta-predictor; +, = , and -denote that the best current predictor or hybridDBRpred is significantly better, not significantly different, significantly worse than another method at P -value < 0.01.
disorder-annotated proteins.Some methods perform well on the disorder-annotated proteins (DNAgenie and DisoRDPbind) or the structure-annotated proteins (DNAPred, Tar-getDNA, and BindN+).Moreover, some methods suffer relatively high cross-prediction rates.These findings are in line with the results of recent works that investigated predictors of protein-binding and RNA-binding residues and which developed new meta-predictors that overcome their limitations ( 77 ,78 ).Motivated by these studies and the trade-offs that we uncovered, we investigate whether a meta-predictor that combines well-performing structure-trained and intrinsic disordertrained predictors of DBRs would provide significant improvements in the predictive performance.Our hybridDBRpred meta-predictor relies on a modern deep transformer network and sequence-derived inputs that provide useful context to accurately combine predictions from the three arguably best current tools: DisoRDPbind, DNAPred and DNAgenie.Table 2 compares hybridDBRpred against the ten predictors.We find that hybridDBRpred produces the best predictions on the full benchmark dataset, with AUC = 0.786, AULCratio = 5.19, F1 = 0.26 and sensitivity = 0.43 at 0.1 FPR.These results are statistically higher than the predictions of the current methods ( p -value ≤ 0.01).To compare, the best scores generated by the current methods are AUC = 0.703 (DNAgenie), AULCratio = 4.50 (DNAPred), F1 = 0.22 (DNAPred), and sensitivity = 0.33 at 0.1 FPR (DNAPred).More importantly, hybridDBRpred generates accurate predictions for both the structure-annotated and the disorderannotated proteins.It obtains the highest AUC = 0.827 for the structure-annotated proteins and also the highest AUC = 0.766 for the disorder-annotated proteins.Moreover, hybridDBRpred produces low amounts of the crossprediction and over-prediction errors.Table 3 reveals that hybridDBRpred's AUCPC that quantifies the level of crosspredictions is 0.201 for the entire test dataset, 0.210 for the structure-annotated test proteins, and 0.237 for the disorderannotated test proteins, compared to the best (lowest) AUCPC values of the current tools that are 0.287 (DNAgenie), 0.276 (DNAPred), and 0.323 (DNAgenie), respectively .Similarly , the hybridDBRpred's over-predictions that we quantify with AUOPC are 0.216 on the full test set, 0.172 for the structuretrained proteins, and 0.234 for the disorder-trained proteins.These are all better (lower) than the results of the existing tools that secure the best AUOPCs of 0.298 (DNAgenie), 0.180 (DNAPred), and 0.316 (DNAgenie), respectively.The improvements in AUCPC and AUOPC values when contrasting hybridDBRpred against each of the ten methods are statistically significant ( P -value ≤ 0.01).In short, the new deep learning-based hybridDBRpred meta-predictor substantially improves over the ten current tools, provides balanced quality of predictions for the disorder-annotated and structure-annotated proteins, and generates low levels of cross-predictions.
We also compare hybridDBRpred with two baseline metapredictors.The baselines include a logistic regression model that utilizes the same inputs as hybridDBRpred and a simple meta-predictor that computes average of the propensities produced by the three tools that we use as hybridDBRpred' s inputs.W e detail these baselines in the supplement.Table 2 shows that the baselines secure similar levels of predictive performance with the AUC of 0.720 (average-based) and 0.726 (regression) on the entire test set, compared to the statistically better AUC of 0.786 from hybridDBRpred ( Pvalue < 0.01).Our deep transformer-based solution also provides statistically significant improvements in AUC for both the structure-annotated and the disorder-annotated test proteins ( P -value < 0.01; Table 2 ).Furthermore, hybridDBRpred outperforms both baselines in the context of the crosspredictions and over-predictions.It obtains AUCPC of 0.201 vs. 0.293 and 0.315 for the baselines, and AUOPC of 0.216 versus 0.271 and 0.272 for the baselines; these differences are statistically significant ( P -value < 0.01; Table 3 ).These results reveal that the deep transformer-based model provides large improvements over simpler meta-predictors.Next, we investigate contributions of specific elements of our model to these improvements.

Ablation analysis
The hybridDBRpred meta-predictor relies on four key innovations: calculation of aggregate features for detection of IDRs; application of the sliding window to collect inputs; use of the transformer modules to design the deep network; and training with the binary cross-entropy loss function.We perform ablation analysis by removing one of these innovations at the time and measuring the difference in the predictive performance when compared to the complete hybridDBRpred model.Supplementary Figure S2 summarizes these results on the test dataset by considering the overall predictive performance measured with AUC and the amount of cross-predictions and over-predictions quantified with AUCPC and AUOPC, respectively.We find that the removal of each of the four innovations leads to a statistically significant drop in predictive quality measured with each of the three metrics ( P -values < 0.01).The features for the detection of IDRs contribute the most to the reduction of crosspredictions ( Supplementary Figure S2 B).The use of the binary cross-entropy loss function helps with improving both crosspredictions and over-predictions ( Supplementary Figure S2 B  and C).The inclusion of the transformer modules and the sliding window has substantial impact on the overall predictive quality and also reduces the cross-and over-predictions ( Supplementary Figure S2 A-C).Altogether, these results suggest that each of the four innovation contributes to the favorable predictive performance of our deep meta-predictor.
We also investigate contributions of each of the three input predictors of DBRs to our meta-predictor.Supplementary Figure S3 compares results of the complete hybridDBRpred model with the three versions of that model where one of the input predictions is removed.This ablation experiment demonstrates that the removal of any of the three inputs produces a statistically significant drop in the predictive quality measured with A UC, A UCPC and A UOPC ( P -values < 0.01).In particular, the AUC decreases from 0.786 (the complete hy-bridDBRpred model) to 0.754, 0.748, and 0.737 when Diso-DRPbind, DNAPred and DNAgenie is excluded, respectively.While these values are still better than the AUCs of 0.726 and 0.720 for the baseline meta-predictors (Table 2 ), which is due to the use of the above-mentioned innovations, they demonstrate that the inclusion of each of the three predictions in hybridDBRpred is warranted.

HybridDBRpred web server
The hybridDBRpred method is freely available as a convenient web server at http:// biomine.cs.vcu.edu/servers/ hybridDBRpred/.It requires only the FASTA-formatted protein sequence as input.It automates the entire prediction process on the server side by running predictions by DNAPred, DisoRDPbind and DNAgenie, generating the inputs to the deep learner, and processing predictions using the transformer network.The server takes about 2 min to predict an average size sequence with about 200 residues.Upon completion of the prediction, users can browse the color-coded prediction results on the webpage and receive text-formatted results to the email address that they (optionally) provide.The outputs produced by the server include the putative propensities from DNAPred, DNAgenie, and DisoRDPbind, together with the predictions of hybridDBRpred.We archive the results for at least three months.

Analysis of the putative DNA-binding residues
Native annotations of DBRs, particularly for the structureannotated proteins, rely on somehow subjective protocols.Most of the methods assume that a given residue binds DNA if at least one of its atoms is close enough to one of the DNA's atoms, using a few different distance thresholds including 3.5 Å ( 25 , 34 , 37 , 49 ) and 4.5Å ( 36 ).BioLiP, which was used to annotate data for DNAgenie, applies a more sophisticated approach where the distance is computed as 0.5 Å plus the sum of the Van der Waal's radii of the closest protein atom and DNA atom ( 87 ).These differences may result in different annotations of native DBRs for the same protein.
We study whether predictions from the best performing methods, including DNAPred, DNAgenie, DisoRDPbind, the two baseline meta-predictors, and hybridDBRpred, are sensitive to these differences by investigating whether the false positives (incorrectly predicted DBRs) are biased to localize close to the native DBRs.Figure 2 analyses the presence of putative DBRs nearby the native DBRs in the sequence; we cannot perform this analysis using proximity in the structure since some annotations concern disordered regions.The x -axis quantifies the number of positions between the residues that we analyze and the nearest native DBRs, while the y -axis gives TPR values when assuming that the putative DBRs within the distance defined by the x -axis are correct.In other words, we count predicted DBRs that are within x = {1, 2, 3, 4, 5} positions from the native DBRs as true positives (solid lines in Figure 2 ).We compare these results against baselines where DNAbinding residues are predicted at random in the same proportions as the putative binding residues generated by the considered predictors (dotted lines in Figure 2 that are color-coded Figure 2. TPR values ( y -axis) computed in the function of the number of positions in the sequence between the evaluated residues and the nearest native DBRs ( x -axis).We consider the four best performing methods from Table 2 and perform this test on the test dataset.We compute the TPR values by assuming that putative DBRs that are within a given number of positions away from the native DBR are correct.Solid lines report results based on the predictions from the four methods while the color-coded dotted lines represent corresponding baselines where DNA-binding residues are predicted at random in the same proportions as the putative binding residues generated by the predictors.
for the corresponding predictors).We find that disproportionally higher numbers of putative DBRs are located immediately adjacent to a native DBR.This can be measured by comparing the increase in TPRs between x = 0 and x = 1 and the subsequent positions with the corresponding baseline results.For instance, TPR of hybridDBRpred grows from 0.49 to 0.59 ((0.59-0.49)/ 0.49 = 20% increase) between x of 0 and 1, and further to 0.66 between x of 1 and 2 ((0.66-0.59)/ 0.59 = 12% increase).The corresponding baseline grows from 0.49 for x = 0 to 0.53 for x = 1 (8% increase), and to 0.56 for x = 2 (6% increase).We observe higher growth for lower values of x for the predictions and substantially larger values when comparing predictions against the baselines.These observations are true for all other methods (Figure 2 ), e.g.0.12 ( x = 0 to x = 1) vs. 0.06 ( x = 1 to x = 3) increases for DNAPred; 0.09 vs. 0.04 for DNAgenie; 0.12 vs. 0.06 for the average-based baseline meta-predictor; they are all coupled with substantially lower values for the baselines.Overall, Figure 2 shows diminishing slopes for each of the six curves that reveals that false positives are concentrated around the positions of the native DBRs.This implies that DBRs predicted for the amino acids adjacent to the native DBRs in the sequence could be driven by the threshold-dependent nature of annotations, and perhaps should not be treated as mistakes.That suggests that the predictive performance that we calculate for these tools might underestimate their actual performance.These findings agree with recent studies that similarly identify an increase in the 'false positives' near the positions of native protein-and nucleic acids-binding residues ( 49 ,85 ).

Summary and conclusions
Current sequence-based predictors of DBRs belong to two distinct groups, those trained on the structure-annotated proteins vs. the disorder-annotated proteins.We identify and summa-rize a comprehensive collection of 34 predictors.We select a representative set of 10 predictors, which include 7 predictors of DBRs and 3 predictors of disordered binding regions.We use them to perform a first-of-its-kind empirical analysis of their ability to accurately predict DBRs using novel and low-similarity benchmark dataset composed of the structureannotated and the disorder-annotated proteins.The most accurate predictions for the structure-annotated proteins are offered by the structure-trained predictors, including the best DNAPred, while the disorder-trained methods perform poorly for these proteins.Moreover, the structure-trained DNAgenie performs well for the disorder-annotated proteins and DisoR-DPbind is the best disorder-trained tool.These observations complement results of recent studies that focus on the evaluation of the predictions of protein-binding and RNA-binding residues ( 77 ,78 ).Analysis of false positives reveals that they are disproportionally concentrated in the vicinity of the native DBRs.This likely stems from a somewhat arbitrary nature of the native annotations of DBRs and suggests that we could be underestimating the actual predictive performance.Moreover, we suggest that more accurate disorder-trained tools are needed due to modest levels of predictive performance of the current tools.We also study the cross-predictions, where residues that bind other / non-DNA ligand types are predicted as DBRs.Except for DNAgenie and DisoRDPbind, the other considered methods make excessive amounts of cross predictions for the disorder-annotated proteins, effectively making ligand-agnostic predictions of all binding residues.Furthermore, we find that TargetDNA, BindN+, DNAPred, DNAgenie, and DisoRDPbind produce relatively low amount of cross-predictions for the structure-annotated proteins.
Most importantly, our empirical results suggest that none of the considered tools offer predictions that are highly accurate across the disorder-annotated and structure-annotated proteins, motivating the development of a novel meta-predictor.

PAGE 11 OF 13
We conceptualize, design, implement, empirically validate and deploy the hybridDBRpred meta-model that combines predictions generated by three arguably most accurate current predictors.Our solution uses a well-designed collection of sequence-derived features and the deep transformer network that we train with an advanced loss function to produce accurate predictions of DBRs.We demonstrate empirically that these innovative design choices produce substantial improvements to the predictive quality of hybrid-DBRpred.Overall, hybridDBRpred provides balanced and high levels of predictive quality across the two annotation types and generates relatively low levels of cross-predictions and over-predictions.We also show that our deep learningbased meta-predictor is statistically more accurate than the results produced by each of the ten tools as well as baseline meta-predictors that rely on simple averaging and logistic regression.We implement hybridDBRpred as a convenient web server that is freely available at http://biomine.cs.vcu.edu/ servers/ hybridDBRpred/ .We also provide the corresponding source code at https:// github.com/jianzhang-xynu/ hybridDBRpred .

PAGE 5 OF 13 Figure 1 .
Figure 1.The topology of the hybridDBRpred predictor.

Table 1 .
Comparison of surv e y s that co v er sequence-based predictors of DBRs ods and some of which interact with the non-DNA partners, such as RNA and proteins.We use these annotations to comparatively assess the cross-predictions.Finally, driven by results of this analysis, we design, assess and release a new deep neural network-based meta-predictor, hybridDBRpred ( hybrid network for D NA-B inding R esidue pred iction), that provides accurate predictions for the structure-and disorderannotated proteins.

Table 2 .
Comparison of the ten str uct ure-trained and disorder-trained predictors of binding residues and the new hybridDBRpred meta-predictor using the sampled test dataset